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ABSTRACT 

Epigenetic regulation is dynamic and cell-type de- 
pendent. The recently available epigenomic data in 
multiple cell types provide an unprecedented oppor- 
tunity for a comparative study of epigenetic land- 
scape. We developed a machine-learning method 
called ChroModule to annotate the epigenetic 
states in eight ENCyclopedia Of DNA Elements cell 
types. The trained model successfully captured the 
characteristic histone-modification patterns asso- 
ciated with regulatory elements, such as promoters 
and enhancers, and showed superior performance 
on identifying enhancers compared with the 
state-of-art methods. In addition, given the fixed 
number of epigenetic states in the model, 
ChroModule allows straightforward illustration of 
epigenetic variability in multiple cell types. Using 
this feature, we found that invariable and variable 
epigenetic states across cell types correspond to 
housekeeping functions and stimulus response, re- 
spectively. Especially, we observed that enhancers, 
but not the other regulatory elements, dictate cell 
specificity, as similar cell types share common en- 
hancers, and cell-type-specific enhancers are often 
bound by transcription factors playing critical roles 
in that cell type. More interestingly, we found some 
genomic regions are dormant in cell type but primed 
to become active in other cell types. These obser- 
vations highlight the usefulness of ChroModule in 
comparative analysis and interpretation of multiple 
epigenomes. 



INTRODUCTION 

Identifying cell-type-specific functional regions is an 
important step to understand the regulatory mechanisms 
underlying cell-type-specific gene expression. Histone 
modifications play critical roles in transcriptional regula- 
tion (1), and their patterns at enhancers often manifest 
the cell-type specificity (2,3). With the fast advancement 
of the next-generation sequencing technology, we have 
seen explosive accumulation of epigenomic data (3-8), par- 
ticularly those generated in many different cell types by the 
ENCyclopedia Of DNA Elements (ENCODE) (9,10) and 
the NIH Epigenomics Roadmap consortium (11). 

The availability of cell-type-specific epigenomic data 
provides a unique opportunity for genome-wide identifi- 
cation of regulatory regions (2,12-14) or transcription 
factor (TF)-binding sites (15-17), which in turn helped 
to predict cell-specific gene expression (18-20). Several 
computational methods have been developed to annotate 
epigenomic states using unsupervised learning methods 
(21-23). For example, Ernst and Kellis (21) used an 
unsupervised hidden Markov model (HMM) called 
ChromHMM to define 41 chromatin states using 49 
histone marks. These 41 chromatin states were then 
annotated and grouped into five categories based on the 
enrichment of known functional sites, such as promoters 
or DNasel hypersensitivity sites (DHSs). This approach 
provides a useful annotation of epigenomic states in the 
genome, but it also has limitations: for example, binary 
representation of the data from chromatin immunopre- 
cipitation followed by sequencing (ChlP-seq) experiment 
may not optimally capture spatial patterns of the 
epigenomic states, and the number of HMM states 
needs to be adjusted based on the number of histone 
marks. Given a deluge of epigenomic data becoming 
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available in various cell types, an urgent need is to conduct 
comparative analysis to reveal the epigenomic landscape 
underlying the dynamics of transcriptional regulation. 

We present here a novel supervised learning method 
called ChroModule that is based on an HMM with 
modular structure to annotate epigenomic states in the 
human genome, which is complementary to the unsuper- 
vised learning methods. Inspired by the study of Filion 
et al. (24) that only five major chromatin states were 
found in Drosophila, we chose to train the HMM using 
six modules to annotate genomic regions of five categories: 
promoter (forward and backward), enhancer, transcribed 
region, repressed region and background. Similar to the 
previous studies (12,15,22), ChroModule exploits mixture 
Gaussians to model the spatial patterns of the epigenomic 
data, which is crucial to capture open chromatin regions 
for potential TF binding. The probabilistic mixture 
Gaussians also flexibly represent the shapes of epigenomic 
signals without pre-selecting the number of HMM states. 
Once the model is trained, it can be applied to any other 
data set containing same epigenomic data, such as histone 
modification and chromatin accessibility data, and thus 
allows direct comparison of epigenomic states across cell 
types or cellular conditions. We illustrated this feature of 
ChroModule by training it in one cell type (Huvec) and 
annotating the other seven cell types without re-training. 
The predicted promoters/enhancers showed significant 
overlap with the ChlP-seq peaks of 58 TFs and p300- 
binding sites, which suggest that ChroModule captures 
functional regions of the genome. 

The annotated regulatory regions in eight cell types 
provided an opportunity to comparatively analyse 
epigenomic states. We proposed an epigenomic variation 
score (EVS) to measure the variation of the epigenomic 
state across cell types. We observed that epigenetically 
invariable regions mark fundamental functions in a cell, 
whereas variable regions were enriched with genes related 
to cell signalling and response to stimuli. Comparison of 
active enhancers across the eight cell types identified 
cell-type-specific enhancers, which show distinct functions 
as well as putative binding sites of TFs crucial to the cor- 
responding cell type. In addition, we also found that 
similar cell types share more common enhancers than 
the dissimilar ones, which is resonant to the concept that 
similar cell types reside close to each other in the epigen- 
etic landscape. Interestingly, we identified cell-type- 
specific regulators by comparing the state representation 
across cell types, which was further validated by the 
ChlP-seq peaks of the TFs. 

MATERIALS AND METHODS 

Data 

We used the data from the ENCODE project (http:// 
genome.ucsc.edu/ENCODE/) and collected the common 
markers in eight cell types: GM 12878 (lymphoblastoid), 
Hmec (human mammary epithelial cells), Hsmm (normal 
human skeletal muscle myoblasts), Huvec (human umbil- 
ical vein endothelial cells), K562 (leukaemia), Nhek 
(normal human epidermal keratinocytes) and Nhlf 



(normal human lung fibroblasts). The chromatin marks 
included H3K4mel/2/3, H3K9ac, H4K20mel, H3K27ac, 
H3K27me3 and H3K36me3. H3K9mel was not 
included because it was not available in all the cell types. 
Additionally, we included chromatin accessibility data 
[DHSs or formaldehyde-assisted isolation of regulatory 
elements followed by sequencing (FAIRE-seq) data] 
(Supplementary Table SI). 



The ChroModule model 

ChroModule is composed of six modules: promoter 
(forward and backward), enhancer, transcribed, repressed 
and background module. ChroModule was trained on 
Huvec, and each module was trained independently. The 
HMM in each module has a left-right structure that is 
widely used in signal processing, such as speech recogni- 
tion, and has been proved to be effective in capturing 
temporal patterns (25). We also chose mixture of 
Gaussians to characterize the shapes of histone modifica- 
tions because it provides a flexible model to represent the 
variable profiles of the sequencing reads. Compared with 
methods that discretize the reads into a limited number of 
states (21), mixture of Gaussians is able to model a broad 
range of variability, which is crucial for handling the noise 
of the sequencing data (15,22). 

In the previous work, we evaluated the impact of the 
number of HMM states and the mixture of Gaussians (the 
Gaussians are not tied) on the prediction performance of 
the model (12,15). We found that HMMs with >3 states 
and >2 Gaussians performed much better than HMMs 
with less number of states and Gaussians. This is 
because enough number of states/Gaussians can effect- 
ively capture the spatial pattern of histone data, such as 
the bimodal pattern of H3K4me3 at promoters and 
H3K4mel at enhancers (15). In ChroModule, we chose 
to use a five-state HMM with five Gaussians for the 
promoter and enhancer module because the five-state 
model could detect not only the strong signals around 
the peaks but also the weak signals at the flanking 
regions of the peak (Supplementary Figure SI). 

Supplementary Figure SI shows the heatmap of histone 
modifications at promoters and enhancers, as well as the 
emission probabilities of the corresponding HMMs. The 
bimodal patterns of epigenomic signals in these regions 
were well represented by the emission probability distribu- 
tions in the five HMM states. For example, the fourth 
state in the promoter HMM module modelled the open 
chromatin region, as shown by the high probability at 
small read count in H3K4me3 and H3K9ac, and it was 
flanked by two shoulder peaks represented by the third 
and fifth states; similarly, the HMM emission probabilities 
in the enhancer HMM module showed clear depletion of 
H3K4mel and other histone marks on the third state, 
whereas the second and fourth states showed higher 
emission probabilities to represent shoulder peaks, and 
the first and fifth state represent the flanking weak 
signals. It is worth of noting that the fine resolution of 
the histone profiles captured by the five-state HMM can 
greatly facilitate further analyses. 
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We used one-state HMM to model the transcribed 
(marked by H3K36me3), repressed (marked by 
H3K27me3) regions and background, which is equivalent 
to those methods using a two-state HMM to identify the 
enrichment of these marks (26-28). Supplementary Table 
S2 summarizes the parameters we chose. 

Bioinformatics tools used in the analyses 

Homer (29) was used for de novo motif finding, and the 
found motifs were compared with the known motifs in 
human and mouse documented in the TRANSFAC (30), 
JASPAR (31) and Uniprobe (32) databases. We used 
DAVID (33) to perform gene ontology analysis. 

RESULTS 

ChroModule captures spatial pattern of the epigenomic 
data in the regulatory regions 

ChroModule is composed of HMM modules, each of 
which has a left-right structure to capture the spatial 
patterns of the epigenomic signals using mixture of 
Gaussians to characterize the shapes of histone modifica- 
tions because it provides a flexible model to represent the 
variable profiles of the sequencing reads [see detailed 
discussion in 'Materials and Methods' section and 
(12,15,22)]. The data used in this study included eight 
histone marks (H3K4me 1/2/3, H3K9ac, H3K27ac, 
H3K27me3, H3K36me3 and H4K20mel), chromatin ac- 
cessibility (DHSs) (Supplementary Table SI). As Filion 
et al. (24) only found five distinct chromatin states, we 
trained six modules on five categories of annotated 
regions (forward/backward promoter, enhancer, repressed 
region, transcribed region and background). Initially, each 
module was trained separately using the Baum-Welch 
algorithm (34). We then linked all modules to construct 
the final model in which the transition probabilities were 
learned from the data (Figure 1A). This modular design 
allows flexible representation of functional states and 
precise training of HMMs. 

Supplementary Figure SI shows the heatmap of histone 
modifications at promoters and enhancers, as well as the 
emission probabilities of the corresponding HMMs. The 
bimodal patterns of epigenomic signals in these regions 
are well represented by the emission probability distribu- 
tions. The fourth and the third states in the promoter and 
enhancer HMM modules characterize the open chromatin 
regions as illustrated by the depletion of histone signals, 
such as H3K4me3 in promoter (Figure IB) and H3K4mel 
in enhancer, respectively (Supplementary Figure SI). Such 
a fine resolution of the histone profiles captured by 
ChroModule can greatly facilitate further analyses. For 
example, when checking enriched motifs in the enhancers, 
one can focus on the open chromatin regions decoded 
as the third state in the enhancer module, which would 
significantly narrow down the searching space for motif 
finding (Figure 4). 

We used one-state HMM to model the transcribed 
(marked by H3K36me3) or repressed (marked by 
H3K27me3) regions, which is equivalent to those 
methods aiming to identify the enrichment of these 



marks (26-28). H3K36me3 and H3K27me3 are known 
marks of transcribed and repressed regions (35), respect- 
ively, and were used in the previous study for annotating 
these regions (36). Even though these two regions are not 
well defined by only one characteristic mark, it is import- 
ant to distinguish them from promoters or enhancers, 
which is especially critical to calculate EVS for comparing 
epigenetic states across cell types. 

ChroModule is a supervised learning model, and we 
selected the training data that represent the most 
probable loci belonging to each category (Supplementary 
Table S2). Because active promoters and enhancers are 
associated with strong H3K4me3 and H3K4mel/2 
marks, respectively (14), we chose TSSs with the highest 
H3K4me3 to train the promoter module and strongest 
distal DHS peaks that are also associated with high 
H3K4mel/2 and low H3K4me3 to train the enhancer 
module. For the transcribed and regressed regions, we 
selected the top 1000 exons in chromosome 1 with a 
high H3K36me3 and H3K27me3 signals (>2 normalized 
read counts), respectively. We took the entire chromosome 
1 to train the background module. We trained 
ChroModule in Huvec (see Supplementary Methods for 
details) and applied the trained model to annotating other 
cell types. We used the Viterbi algorithm (34) to assign 
HMM states to each 100-bp bin (Figure 1C shows the 
ChroModule annotation based on epigenome data in 
K562 and Figure ID for all cell types). 

Especially interesting, ChroModule showed flexibility to 
capture various types of spatial patterns, such as both uni- 
and bi-modal patterns: uni-modal enhancers were repre- 
sented by a sequence of states without visiting third state 
(Supplementary Figure S2A) in contrast to bimodal 
enhancers represented with at least one third state, as 
well as second and fourth states. As uni-modal enhancers 
were observed at the binding loci of androgen receptor 
(17,37) because of the dynamic nucleosome positioning, 
a single model to capture divergent spatial patterns of 
histone modifications manifests the unique feature of 
ChroModule to represent diverse chromatin states in a 
general way. Indeed, when clustering predicted enhancers 
in Hmec, there are distinct sub-classes of enhancers 
with diverse combinations of histone modifications 
(Supplementary Figure S3). We investigated the expres- 
sion levels of the nearest genes of the bi-modal and 
uni-modal enhancers (Supplementary Figure S2B) and 
did not observe statistically significant difference 
(P = 0.4). This observation is not unexpected, as a 
previous study showed that dynamic changes in nucleo- 
some occupancy are not predictive of gene expression (38). 
There can be several possible reasons, including that genes 
are regulated by more than one enhancer, and their ex- 
pressions are thus not tightly correlated with the dynamics 
of the nucleosome of one enhancer. 

Genome-wide annotation using ChroModule 

We observed that enhancers are distributed more broadly 
than promoters in the genome. ChroModule identified 
38 214 (ranging from 21140 in K562 to 27 237 in 
GM 12878) non-overlapping promoter blocks in the eight 
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Figure 1. (A) The structure of ChroModule. There are six modules in ChroModule: forward promoter, backward promoter, enhancer, 
H3K36me3-enriched region (transcribed region), H3K27me3-enriched region (repressed region) and background. Each module has a left-right 
structure, i.e. each state transits to itself or the states located to its right (12,15). (B) Emission probabilities of the five-state HMM for promoter. 
The fourth state represents the open chromatin region of depleted H3K4mel/2/3 and enriched DNasel signals. (C) Example ChroModule annotation 
and the epigenomic data in the K562 cells. (D) Example ChroModule annotation of the eight cell types. The probability of each HMM state and 
EVS are shown in STAR browser (http://wanglab.ucsd.edu/star/browser) for each cell type. 



cell types spanning 89 Mb, compared with 199 200 
(ranging from 37 328 in K562 to 66419 in Nhlf) 
enhancer blocks spanning 260 Mb of the human genome 
(Supplementary Table S3). Indeed, similar number of pre- 
dicted promoters was found across cell types, but the 
number of enhancers varied significantly. We also 
noticed that the majority of the human genome was 
unlabelled (Supplementary Figure S5) often because of 
insufficient sequencing reads available. 

The annotation in the eight ENCODE cell types allowed 
comparative visualization of the epigenetic states. An 
example region is shown in Figure 1C and D: the promoter 
of SLC25A23 shows invariable epigenetic states, in 



contrast to the variable CRB3 promoter. SLC25A23 is a 
calcium-dependent mitochondrial solute carrier (39), and it 
is not surprising that it plays roles in many cell types. 
Crumbs protein homolog 3 (CRB3) functions in epithelial 
cell polarity (39), resonant to the active histone marks in 
epithelial (Hmec) and epidermal (Nhek) cells. 

ChroModule annotating the genome with a 
high performance 

To access the quality of ChroModule annotation, we 
evaluated the annotated promoters and enhancers. The 
promoter annotations in the eight cell types showed con- 
sistently satisfactory accuracy, as ~60% of the RefSeq 
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genes were predicted to have active promoters with a false- 
positive rate <1% (Figure 2A). The remaining 40% 
RefSeq promoters are missed mainly because of lack of 
H3K4me3, an epigenetic mark for active promoter. This 
performance is similar to the previous studies using an 
unsupervised learning method ChromHMM (21). 

To evaluate the accuracy of the predicted enhancers is 
challenging because of the lack of a gold standard of 
enhancers. As the binding of transcriptional co-factor 
p300 or any TF often indicates location of enhancers, 
we collected the p300-binding sites that are distal 
(>2.5kb) from any annotated RefSeq transcription start 
site (TSS) in HI, K562 and GM12878 cells, as well as 58 
TF ChlP-seq experiments (Supplementary Table SI) that 



determine the binding sites of these TFs (TF-binding sites) 
in GM 12878 and K562 cells. We then evaluated the 
overlap between the predicted enhancers and p300- or 
TF-binding sites (Figure 2 and Table 1). As a comparison, 
we also evaluated the performance of another HMM- 
based method ChromHMM (21,36) using the same 
criteria. In contrast to ChroModule, ChromHMM is an 
unsupervised learning method and discretizes histone 
modification reads to binary states (presences or 
absence). We found that ChroModule consistently outper- 
formed ChromHMM (Table 1, Figure 2 and 
Supplementary Figure S6) in predicting enhancers. For 
example, ~76% and 83% of all predicted enhancers in 
GM 12878 and K562 cells, respectively, overlap with TF 
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Figure 2. Evaluation of the ChroModule performance on (A) promoters (accessed using RefSeq TSSs). ChroModule results (promoters and strong 
promoters) were obtained from ENCODE (36). (B) Assessment of the enhancers predicted by ChroModule and ChromHMM using p300-binding 
sites that are distal (>2.5kb) from Refseq TSSs. ChroModule outperformed ChromHMM in all the cell types. ChromHMM results (enhancer, strong 
enhancer) were downloaded from the study of Ernst et al. (36). Supplementary Figure S6 has comparison in HI, K562 and GM12878. 
(C) Assessment of the enhancers predicted by ChroModule and ChromHMM using TF-binding sites in Gml2878 and K562 cells. (D) The com- 
parison of ChroModule models independently trained in Huvec and GM 12878 (V2). Receiver operating characteristic curves (ROC) curves generated 
by using RefSeq promoters to evaluate the promoter prediction. 
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Table 1. Performance of ChroModule and ChromHMM on predicting promoters and enhancers evaluated using RefSeq promoters 
and distal p300-binding sites, respectively 
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ChlP-seq peaks that are at least 2.5 kb away from pro- 
moters (V<10 -130 ), compared with 68% (42%) and 
68% (44%) for the strong (all) enhancers predicted by 
ChromHMM (Supplementary Table S4). It is also worth 
of noting that enhancers with relatively large open chro- 
matin region (visiting the third states of the enhancer 
HMM at least twice) flanked by shoulder peaks (the 
second and fourth states of the enhancer HMM) have 
greater overlaps with DHSs (Supplementary Table S3), 
which further illustrates the advantage of capturing the 
fine spatial pattern of the chromatin modifications. 

To investigate the robustness of ChroModule, we 
trained the model using the epigenomic data in 
GM 12878 instead of Huvec and tested it on the other 
two cell lines: K562 and HI (Figure 2D). Regardless of 
how the models were trained, ChroModule showed com- 
parable performance. As the robustness of a model is 
crucial to annotating epigenomes, such a feature of 
ChroModule makes it a powerful tool for analysing 
epigenomic data in diverse cell types. 

We also checked the portion of the predicted promoters 
and enhancers that overlap with DHSs measured in the 
same cell type and different cell types. We found that, 
although a majority of the predicted promoters and en- 
hancers overlap with DHSs (FAIRE-seq data in Hsmm) in 
the same cell type, there is a significant increase of overlap 
percentage (from ~70 to >95% of predicted promoters/ 
enhancers) when considering open chromatin regions in 
all eight-cell types (Supplementary Figure S4). Although 
the mechanism underlying this observation is unclear, 
these genomic regions may be dormant in one-cell type 
(no open chromatin) but primed to become active (open 
chromatin) in other cell types. 

Epigenomic variation score 

To quantitatively define the variation of epigenetic states 
across the eight cell types, we computed the entropy of 
ChroModule labels in each 100-bp bin as the EVS 
EVS — — ^ Pj\r\(Pj), where Pi is the occurrence percent- 
age of promoter, enhancer, transcribed region, repressed 
region or background labels in all cell types (Figure ID). 
We observed 9504 bins with zero EVS (invariable) and 
16 876 bins with an EVS of > 1.1 (variable). Notably, en- 
hancers and promoters, respectively, show consistently 



high and low average EVSs across the eight cell types, 
which indicate the intrinsic variability difference of their 
epigenetic states (Supplementary Table S5). Transcribed 
regions show consistently high EVSs, which may be due 
to alternative splicing in different cell types. It is not 
surprising that the average EVSs of repressed regions 
vary on cell types because repressed regions in one cell 
type might be active in other cell types. 

Checking the genes associated with these annotated 
regions, we found that epigenetically invariable regions 
are related to housekeeping functions, such as promoters 
related to RNA processing' and 'cell cycle', transcribed 
regions to RNA splicing' and 'translation', enhancers to 
'cell death' and 'actin cytoskeleton organization' and re- 
pressed regions to 'neuron differentiation'. In contrast, the 
epigenetically variable regions are related to stimulus 
response as shown by the enriched gene ontology (GO) 
terms found by DAVID (33), such as 'cell adhesion' and 
'cell-cell signalling' (Table 2). 

Enhancers dominate cell type specificity 

To investigate which functional regions were critical to 
determine cell specificity, we calculated the number of 
mismatches of ChroModule states (promoter, enhancer, 
transcribed, repressed or background) between cell types 
and clustered the eight cells using this epigenomic distance 
as the metric. When using enhancers to compute the 
epigenomic distance, the pluripotent HI cell is distinct 
from the remaining differentiated cells, and the epidermal 
(Hmec and Nhek) and lymphocytic cells (K562 and 
GM 12878) are close to each other (Figure 3). This 
cluster of cells resembles the cell-type similarity much 
better than the clusters generated using promoter alone, 
promoter plus enhancer or all the annotated regions 
(Supplementary Figure S8). This observation confirmed 
that enhancers dictate cell-type specificity (2). 

We then conducted GO term analysis (33) on the pro- 
moters and the closest genes of the enhancers predicted in 
all or only one cell type (Table 3 and Supplementary Table 
S6). The common enhancers across eight cell types are 
related to 'cell death', consistent with the enriched func- 
tions of invariable enhancers with low EVS. Checking the 
functions related to cell-type-specific enhancers, we 
observed a strong correlation between the enriched GO 
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Table 2. Functions of the genes 
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Figure 3. The epigenetic distance between cell types calculated based 
on the enhancer segmentation using the Pvclust R package (41). 
Clusters with unbiased P > 0.95 are indicated by the rectangles. See 
Supplementary Figure S8 for other clusters. 



observed enrichment of ChlP-seq peaks of Pu.l in the 
GM 12878 enhancers (13 031 of 39 662 predicted enhan- 
cers) or those in the GM12878-specific enhancers with 
an open chromatin (2423 of 4998). The third example is 
the identification of the GATA1 motif in K562, whose 
functional roles in K562 were previously reported in the 
literature (42). Genome-wide ChlP-seq analysis of 
GATA1 also showed the enrichment of peaks in K562- 
specific enhancers (2632 of 6315 enhancers with an open 
chromatin, P< 10" 130 ). 



terms and the function of the cell. For example, 'lympho- 
cyte activation' is the most significant GO term in the 
lymphoblastoid cell GM12878. The functions of 
common promoters, such as 'RNA processing' and 'cellu- 
lar macromolecule catabolic process', are essential to the 
cell. The enriched GO terms associated with the cell-type- 
specific promoters are less well associated with cell speci- 
ficity than enhancers. 

Comparative methods found cell-type-specific master 
regulators 

We searched for enriched sequence motifs in the enhancers 
using Homer (29) (Figure 4). We restricted the search in 
the open chromatin regions marked by the third state of 
the enhancer HMM. In the common enhancers, we found 
the enrichment of the motif recognized by Fos that regu- 
lates diverse biological processes from 'proliferation and 
differentiation' to 'defence against invasion and cell 
damage' (41). For cell-type-specific enhancers, we also 
found motifs recognized by TFs that specifically 
function in the corresponding cells. For example, a motif 
identified in HI cell is similar to the motif of Oct4 (15), a 
master regulator of embryonic stem cell (Figure 4). 
Indeed, we observed much larger portion of Oct4 
ChlP-seq-binding sites in the HI -specific enhancers with 
open chromatin region (78 of 504 HI -specific enhancers 
with bins marked as the third state in the enhancer HMM) 
compared with that in all the enhancers (564 of 24280 
enhancers) (P < 10 — 13 °). Another example is the Pu.l 
motif found in GM 12878, which is consistent with the 



DISCUSSION 

With the fast accumulation of epigenomic data, there 
is an urgent need for global analysis of such data and 
annotation of the genome in a cell-type-specific manner. 
The ChroModule method developed in this study provides 
a useful tool to label functional regions based on epigen- 
etic information. ChroModule has several unique features. 
First, the design of ChroModule allows separate training 
of individual modules and then linking these modules to 
build a full model, which significantly reduces the com- 
plexity of model tuning. The modular design of HMM has 
been successfully applied to biological sequence analysis 
(43^16). Modular HMM not only allows easy interpret- 
ation of the decoding results but also often achieves higher 
prediction accuracy than non-modular models especially 
in the biology domain because it is non-trivial to automat- 
ically learn the HMM structure from complicated 
and noisy biological data (47,48). In addition, modular 
design allows easy extension of the model to represent 
new biological observations by including additional 
modules. 

Second, ChroModule models the sequencing data 
directly that avoids the arbitrariness of selecting the cut- 
off for discretization as done in ChromHMM. As shown 
in this and previous studies (12,15,22), mixture of 
Gaussians capture fine spatial patterns that can greatly 
facilitate follow-up analysis, such as searching for motifs 
recognized by TFs in the open chromatin states (e.g. the 
third state of the enhancer HMM module). In addition, 
as shown in the Supplementary Figure S3, diverse 
chromatin patterns can be represented by a single 
module in ChroModule. 
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Table 3. Cell-type-specific enhancers and the functions of the closest genes 



Type 


Number of 

cell-type-specific 

enhancers 


Number of 
assigned genes 


GO terms 


Common enhancers 


522 


435 


Cell death (43;2.2E-5 ) 
Apoptosis (38;1.2E-5) 


HI specific 


21 353 


8274 


Human embryonic stem cell 

Neuron dnterentiation (276;1.7h-23) ( ) 

Cell morphogenesis involved in differentiation (169;7.0E-20) 


OMlZo/o specinc 


iy 4JU 




Lymphoblastoid 

Regulation of lymphocyte activation (107;9.5E-14) ( a ) 
Regulation of leucocyte activation (1 16;1.2E-13) 
Regulation of T cell activation (85;3.5E-11) 


Hmec specific 


10224 


5159 


Human mammary epithelial cell 
Cell motion (173;3.9E-6) f) 
Cell adhesion (236;3.9E-6) 


Hsmm specific 


10934 


5684 


Normal human skeletal muscle myoblasts 
Skeletal system development (144;8.0E-ll a ) 


Huvec specific 


1 1 383 


5492 


Human umbilical vein endothelial cell 

Enzyme-linked receptor protein signalling pathway (145;2.8E-8) ( a ) 
Blood vessel development (100;9.8E-5) 


K562 specific 


15 827 


7287 


Leukaemia 

Positive regulation of leucocyte proliferation (41;9.4E-5) ( a ) 
Positive regulation of lymphocyte proliferation (40;9.9E-5) 


Nhek specific 


8356 


4959 


Normal human epidermal keratinocytes 

Cell morphogenesis involved in differentiation (104;l.lE-7 a ) 
Neuron projection morphogenesis (94; 5.5E-8) 


Nhlf specific 


16691 


6377 


Normal human lung fibroblasts 
Cell motion (219; 8.9E-11) ( a ) 
Lung development (53;1.5E-4) 



Because multiple enhancers can be assigned to the same gene, the number of assigned genes is often smaller than that of enhancers. We used DAVID 
(33) for GO analysis. Inside the parenthesis are the numbers of genes in each term and the Benjamini-Hochberg adjusted /'-value. We selected three 
biological processes from the significant categories. 
a The most significantly enriched biological processes. 
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Hsmm 


IMCAGCTGI 


MyoD (1e-65) 
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Figure 4. Enriched motifs found by Homer (29). 





Third, ChroModule has a small number (five) of func- 
tional categories, which does not require the non-trivial 
process of determining the number of HMM states 
in the unsupervised learning approach. In addition, 
the output annotation of ChroModule is easy to inter- 
pret without relying on other data or knowledge to 



annotate the HMM states as in unsupervised learning 
methods (36). 

Unsupervised learning methods have been applied 
to annotating epingenomes (23,36,49) and searching for 
combinatorial patterns of epigenetic modifications (50). 
We conducted a bona fide assessment of the performance 
of ChroModule, especially on the predicted promoters 
and enhancers, using p300- and TF-binding peaks. 
When compared with a state-of-art method 
ChromHMM, ChroModule consistently showed superior 
performance in identifying enhancers in different cell 
types. Unsupervised learning methods can uncover novel 
features of epigenetic modifications, whereas supervised 
learning methods can take advantages of the existing 
knowledge to extract information of interest more accur- 
ately. We thus believe ChroModule provides a powerful 
tool that is complementary to the unsupervised methods, 
such as ChromHMM and Segway (23,49) in annotating 
epigenetic states of the cell. 

Given the fixed number of functional categories, the 
epigenetic annotations made by ChroModule can be 
compared directly across cell types. Taking advantage of 
such a comparative annotation, we defined the EVS to 
quantitatively measure the variability of epigenetic state. 
The functional analyses based on the EVS showed that 
EVS is a useful metric to define cell-type-specific 
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regulation. Especially, enhancers showed relatively higher 
EVSs than the other functions regions and enhancers 
dominate the cell-type specificity. Furthermore, cell- 
type-specific enhancers are also enriched with motifs of 
TFs that play important roles in the corresponding cell. 

Interestingly, we found some promoters and enhancers 
are dormant in one-cell type but become active in other 
cell types, as they do not overlap with DHS peaks in the 
same cell type but with DHS peaks in other cell types. TF 
binding or epigenetic marks priming for future gene acti- 
vation in differentiation has been observed in embryonic 
stem (ES) cells (51,52). Our observation suggests epigen- 
etic priming may exist more profoundly even in 
differentiated cells. This hypothesis is waiting for further 
experimental test. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1-6, Supplementary Figures 1-8, 
Supplementary Methods and Supplementary Reference 
[53]. 
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